85 research outputs found
Enumerating the k closest pairs mechanically
Let be a set of points in -dimensional space, where is a constant, and let be an integer between and . An algorithm is given that computes the closest pairs in the set in time, using space. The algorithm fits in the algebraic decision tree model and is, therefore, optimal
Computational Molecular Biology
Computational Biology is a fairly new subject that arose in response to the computational problems posed by the analysis and the processing of biomolecular sequence and structure data. The field was initiated in the late 60's and early 70's largely by pioneers working in the life sciences. Physicists and mathematicians entered the field in the 70's and 80's, while Computer Science became involved with the new biological problems in the late 1980's. Computational problems have gained further importance in molecular biology through the various genome projects which produce enormous amounts of data. For this bibliography we focus on those areas of computational molecular biology that involve discrete algorithms or discrete optimization. We thus neglect several other areas of computational molecular biology, like most of the literature on the protein folding problem, as well as databases for molecular and genetic data, and genetic mapping algorithms. Due to the availability of review papers and a bibliography this bibliography
{EDISON}-{WMW}: Exact Dynamic Programing Solution of the {Wilcoxon}-{Mann}-{Whitney} Test
In many research disciplines, hypothesis tests are applied to evaluate whether findings are statistically significant or could be explained by chance. The WilcoxonâMannâWhitney (WMW) test is among the most popular hypothesis tests in medicine and life science to analyze if two groups of samples are equally distributed. This nonparametric statistical homogeneity test is commonly applied in molecular diagnosis. Generally, the solution of the WMW test takes a high combinatorial effort for large sample cohorts containing a significant number of ties. Hence, P value is frequently approximated by a normal distribution. We developed EDISON-WMW, a new approach to calculate the exact permutation of the two-tailed unpaired WMW test without any corrections required and allowing for ties. The method relies on dynamic programing to solve the combinatorial problem of the WMW test efficiently. Beyond a straightforward implementation of the algorithm, we presented different optimization strategies and developed a parallel solution. Using our program, the exact P value for large cohorts containing more than 1000 samples with ties can be calculated within minutes. We demonstrate the performance of this novel approach on randomly-generated data, benchmark it against 13 other commonly-applied approaches and moreover evaluate molecular biomarkers for lung carcinoma and chronic obstructive pulmonary disease (COPD). We found that approximated P values were generally higher than the exact solution provided by EDISON-WMW. Importantly, the algorithm can also be applied to high-throughput omics datasets, where hundreds or thousands of features are included. To provide easy access to the multi-threaded version of EDISON-WMW, a web-based solution of our algorithm is freely available at http://www.ccb.uni-saarland.de/software/wtest/
Algorithm engineering for optimal alignment of protein structure distance matrices
Protein structural alignment is an important problem in computational
biology. In this paper, we present first successes on provably optimal pairwise
alignment of protein inter-residue distance matrices, using the popular Dali
scoring function. We introduce the structural alignment problem formally, which
enables us to express a variety of scoring functions used in previous work as
special cases in a unified framework. Further, we propose the first
mathematical model for computing optimal structural alignments based on dense
inter-residue distance matrices. We therefore reformulate the problem as a
special graph problem and give a tight integer linear programming model. We
then present algorithm engineering techniques to handle the huge integer linear
programs of real-life distance matrix alignment problems. Applying these
techniques, we can compute provably optimal Dali alignments for the very first
time
Systematic permutation testing in GWAS pathway analyses: identification of genetic networks in dilated cardiomyopathy and ulcerative colitis
Background: Genome wide association studies (GWAS) are applied to identify genetic loci, which are associated with complex traits and human diseases. Analogous to the evolution of gene expression analyses, pathway analyses have emerged as important tools to uncover functional networks of genome-wide association data. Usually, pathway analyses combine statistical methods with a priori available biological knowledge. To determine significance thresholds for associated pathways, correction for multiple testing and over-representation permutation testing is applied. Results: We systematically investigated the impact of three different permutation test approaches for over-representation analysis to detect false positive pathway candidates and evaluate them on genome-wide association data of Dilated Cardiomyopathy (DCM) and Ulcerative Colitis (UC). Our results provide evidence that the gold standard - permuting the caseâcontrol status â effectively improves specificity of GWAS pathway analysis. Although permutation of SNPs does not maintain linkage disequilibrium (LD), these permutations represent an alternative for GWAS data when caseâcontrol permutations are not possible. Gene permutations, however, did not add significantly to the specificity. Finally, we provide estimates on the required number of permutations for the investigated approaches. Conclusions: To discover potential false positive functional pathway candidates and to support the results from standard statistical tests such as the Hypergeometric test, permutation tests of case control data should be carried out. The most reasonable alternative was caseâcontrol permutation, if this is not possible, SNP permutations may be carried out. Our study also demonstrates that significance values converge rapidly with an increasing number of permutations. By applying the described statistical framework we were able to discover axon guidance, focal adhesion and calcium signaling as important DCM-related pathways and Intestinal immune network for IgA production as most significant UC pathway
Louse (Insecta : Phthiraptera) mitochondrial 12S rRNA secondary structure is highly variable
Lice are ectoparasitic insects hosted by birds and mammals. Mitochondrial 12S rRNA sequences obtained from lice show considerable length variation and are very difficult to align. We show that the louse 12S rRNA domain III secondary structure displays considerable variation compared to other insects, in both the shape and number of stems and loops. Phylogenetic trees constructed from tree edit distances between louse 12S rRNA structures do not closely resemble trees constructed from sequence data, suggesting that at least some of this structural variation has arisen independently in different louse lineages. Taken together with previous work on mitochondrial gene order and elevated rates of substitution in louse mitochondrial sequences, the structural variation in louse 12S rRNA confirms the highly distinctive nature of molecular evolution in these insects
Computation of significance scores of unweighted Gene Set Enrichment Analyses
<p>Abstract</p> <p>Background</p> <p>Gene Set Enrichment Analysis (GSEA) is a computational method for the statistical evaluation of sorted lists of genes or proteins. Originally GSEA was developed for interpreting microarray gene expression data, but it can be applied to any sorted list of genes. Given the gene list and an arbitrary biological category, GSEA evaluates whether the genes of the considered category are randomly distributed or accumulated on top or bottom of the list. Usually, significance scores (p-values) of GSEA are computed by nonparametric permutation tests, a time consuming procedure that yields only estimates of the p-values.</p> <p>Results</p> <p>We present a novel dynamic programming algorithm for calculating exact significance values of unweighted Gene Set Enrichment Analyses. Our algorithm avoids typical problems of nonparametric permutation tests, as varying findings in different runs caused by the random sampling procedure. Another advantage of the presented dynamic programming algorithm is its runtime and memory efficiency. To test our algorithm, we applied it not only to simulated data sets, but additionally evaluated expression profiles of squamous cell lung cancer tissue and autologous unaffected tissue.</p
Clinical predictors of long-term survival in newly diagnosed transplant eligible multiple myeloma - an IMWG Research Project
Purpose: multiple myeloma is considered an incurable hematologic cancer but a subset of patients can achieve long-term remissions and survival. The present study examines the clinical features of long-term survival as it correlates to depth of disease response. Patients & Methods: this was a multi-institutional, international, retrospective analysis of high-dose melphalan-autologous stem cell transplant (HDM-ASCT) eligible MM patients included in clinical trials. Clinical variable and survival data were collected from 7291 MM patients from Czech Republic, France, Germany, Italy, Korea, Spain, the Nordic Myeloma Study Group and the United States. KaplanâMeier curves were used to assess progression-free survival (PFS) and overall survival (OS). Relative survival (RS) and statistical cure fractions (CF) were computed for all patients with available data. Results: achieving CR at 1 year was associated with superior PFS (median PFS 3.3 years vs. 2.6 years, pâ<â0.0001) as well as OS (median OS 8.5 years vs. 6.3 years, pâ<â0.0001). Clinical variables at diagnosis associated with 5-year survival and 10-year survival were compared with those associated with 2-year death. In multivariate analysis, age over 65 years (OR 1.87, pâ=â0.002), IgA Isotype (OR 1.53, pâ=â0.004), low albuminâ<â3.5âg/dL (ORâ=â1.36, pâ=â0.023), elevated beta 2 microglobulinââ„â3.5âmg/dL (OR 1.86, pâ<â0.001), serum creatinine levelsââ„â2âmg/dL (OR 1.77, pâ=â0.005), hemoglobin levelsâ<â10âg/dL (OR 1.55, pâ=â0.003), and platelet countâ<â150k/ÎŒL (OR 2.26, pâ<â0.001) appeared to be negatively associated with 10-year survival. The relative survival for the cohort was ~0.9, and the statistical cure fraction was 14.3%. Conclusions: these data identify CR as an important predictor of long-term survival for HDM-ASCT eligible MM patients. They also identify clinical variables reflective of higher disease burden as poor prognostic markers for long-term survival
- âŠ